ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API #8088
ARROW-9992: [C++][Python] Refactor python to arrow conversions based on a reusable conversion API #8088kszucs wants to merge 80 commits into
Conversation
|
Are there some related JIRAs that will be fixed by this? |
|
Not directly, but this will help to solve multiple ones. I'm going to create one. |
|
Just executed the ASV benchmarks on this PR and the master branch with the two sample sizes and the results showed slighty better performance in both cases: Also compared the So far there is no performance degradation - if we can trust the ASV benchmarks. |
pitrou
left a comment
There was a problem hiding this comment.
Why separate PyArrayConverter and ArrayConverter?
My hope is that the classes in |
bkietz
left a comment
There was a problem hiding this comment.
This looks like a neat cleanup of the conversion code, and potentially reusable as well. Thanks for doing this! I have a few comments on the overall structure below
|
@romainfrancois since the conversion code in Could you please take a look at the base classes in converter.h to see whether would it help/improve the R side implementation? I have not documented it yet because it's still under discussion but you can take a look how I'm using it in the python bindings. |
|
@github-actions crossbow submit test-spark |
|
Revision: 79e8229d6828f7f63f80db75a4ef89af9deb03d3 Submitted crossbow builds: ursa-labs/crossbow @ actions-541
|
|
@ursabot build |
bkietz
left a comment
There was a problem hiding this comment.
Chunker is neat, thanks for working on this
My main concern here is overcomplexity of the Converter<> interface, since it currently bears responsibilities both as the abstract base class for conversion and as a trait mapping from DataType subclasses to concrete converters (used to enable generic construction).
I think Converter<>, Chunker<> and the mixins PrimitiveConverter<> etc can remain in util/converter.h but costruction logic (MakeConverterImpl) should be moved back to python_to_arrow.h. Fully generic construction is definitely a worthwhile goal but I also think it's worth doing so in a dedicated follow up. Establishing the base classes for generic conversion is already a great step forward
33c5aa7 to
f3be748
Compare
bkietz
left a comment
There was a problem hiding this comment.
This is looking much better, just a few more simplifications to construction logic and I think this will be ready to go
There was a problem hiding this comment.
The converter's Reserve() might fail due to overflow, but this shouldn't be an error for Chunker::Reserve since an overflow can be averted by finishing the chunk. See for example the logic surrounding ChunkedBinaryBuilder::extra_capacity_. This doesn't need to be handled in this PR since Reserve is a performance hint; for now I think we can just ignore errors emitted by this call to Reserve
jorisvandenbossche
left a comment
There was a problem hiding this comment.
(only briefly looked at the python code + tests, impressive set of changes ;-))
There was a problem hiding this comment.
It doesn't raise on "d" being present here?
There was a problem hiding this comment.
It tolerates extra items (similarly when converting from dicts). I can restrict it if you think that would be more appropiate. We may want to alter this behavior with specific conversion flags.
There was a problem hiding this comment.
Since the order matters when constructing from pairs (based on the test below), I would personally expect additional elements would also not be allowed (BTW, what actually happens if the additional element is not in the last position? Maybe add a test for that as well?).
While when constructing from dicts also the order does not matter I suppose? So this could be more tolerant
There was a problem hiding this comment.
It will raise since the key field equality is checked explicitly. It also supports converting less elements than the number of fields.
Targets of the refactoring:
Resultreturn values and containis_utf8flagFixes
ARROW-9997 [Python] StructScalar.as_py() fails if the type has duplicate field namesBackward incompatibility
Since a struct type can contain duplicated field names we cannot return a struct scalar as a mapping, so I had to change the.as_py()representation to return with a list of key-value pairs.TODOs:
Library size
Before:
After:
Benchmarks
Executed the following ASV benchmark:
After some optimization: